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Abstract 

This paper presents an optimized low-complexity and high-throughput multiple-input multiple-output (MIMO) 
signal detector core for detecting spatially-multiplexed data streams. The core architecture supports various layer 
configurations up to 4, while achieving near-optimal performance, as well as configurable modulation constellations 
up to 256-QAM on each layer. The core is capable of operating as a soft-input soft-output log-likelihood ratio 
(LLR) MIMO detector which can be used in the context of iterative detection and decoding. High area-efficiency 
is achieved via algorithmic and architectural optimizations performed at two levels. First, distance computations and 
slicing operations for an optimal 2-layer maximum a posteriori (MAP) MIMO detector are optimized to eliminate the 
use of multipliers and reduce the overhead of slicing in the presence of soft-input LLRs. We show that distances can 
be easily computed using elementary addition operations, while optimal slicing is done via efficient comparisons with 
soft decision boundaries, resulting in a simple feed-forward pipelined architecture. Second, to support more layers, an 
efficient channel decomposition scheme is presented that reduces the detection of multiple layers into multiple 2-layer 
detection subproblems, which map onto the 2-layer core with a slight modification using a distance accumulation 
stage and a post-LLR processing stage. Various architectures are accordingly developed to achieve a desired detection 
throughput and run-time reconfigurability by time-multiplexing of one or more component cores. The proposed core 
is applied as well to design an optimal multi-user MIMO detector for LTE. The core occupies an area of 1.58 MGE 
and achieves a throughput of 733 Mbps for 256-QAM when synthesized in 90 nm CMOS. 

I. Introduction 

Multiple-input multiple-output (MIMO) systems have become mainstream technology for achieving high spectral 
efficiencies in wireless communications standards such as IEEE 802.1 lac [1] and the 3GPP Long-Term Evolution 
(LTE) [2]. Detection of spatially-multiplexed MIMO streams plays a key role in receiver design, both in terms of 
performance and complexity, and has remained to be an active area of research [3]-[7]. The focus has been on 
developing area-/energy-efficient VLSI implementations of MIMO detectors that are capable of achieving close to 
optimal performance. 

M. M. Mansour is with the Department of Electrical and Computer Engineering at the American University of Beirut, Lebanon, e-mail: 
mmansour@ieee.org. L. Jalloul is with Qualcomm Inc., San Jose, CA, e-mail: jalloul@ieee.org. 
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A plethora of MIMO detectors have appeared in the literature on this subject, offering various performance- 
complexity tradeoffs. Suboptimal zero-forcing (ZF) and minimum mean-squared error (MMSE) detectors [5], as well 
as nonlinear parallel and successive interference cancellation schemes [8]-[l 1], require relatively low complexity but 
sacrifice performance. On the other hand, tree-search or list-based detectors require substantially higher complexity 
but can offer (near-)ML performance, such as the well-known sphere decoding algorithm [12]-[19]. Other tree- 
search schemes, such as the K-Best algorithm [20]-[26], address the non-deterministic throughput aspects of sphere 
decoders. Practical implementation aspects have been investigated in [18], [23], [25]-[39]. 

Subspace detection based on channel decomposition offers a good compromise between performance and com¬ 
plexity (e.g. see [40]-[43]). In these schemes, the effective MIMO channel matrix is decomposed into parallel 
subchannels that can be used to detect subsets of streams in parallel. By allowing subspaces to overlap, additional 
diversity can be gathered by putting a low reliable data stream into several detection sets. The LORD algorithm 
proposed in [44], [45] can be viewed as a special class of subspace MIMO detectors. It achieves ML performance 
(in the max-log-MAP [46] sense) on 2 transmit antennas, but its performance degrades when the number of 
antennas increases. In [47], the LORD algorithm was generalized to 4-transmit antennas by using matrix inversion 
to decompose the channel into single streams. 

Support for ever increasing data rates has come through an increase in the number of supported spatial streams, 
or through the use of more bandwidth via carrier aggregation [48]. LTL-Advanced uses up to 8 spatial streams, 
or the aggregation of five component carriers for a bandwidth of 100 MHz, which lead to staggering speeds of 
over 1 Gbps. While the receiver complexity to detect 8 spatial layers remains to be very challenging especially 
for dense constellations, the use of carrier aggregation with distinct or separate physical layers and convergence 
at higher layers seems more tractable. Since each physical layer of a component carrier is required to support 2 
or 4 spatial layers, the need for the hardware optimization of these MIMO detector cores becomes paramount, 
especially if near-ML performance is desired, higher-order modulations such as 256-QAM are to be supported, and 
high-throughput processing is a must. 

Contributions: We propose in this work an optimized and configurable 2x2 soft-input soft-output maximum a 
posteriori (MAP) MIMO detector, and use it as a basic building block for constructing high-throughput detectors for 
higher-order layers. The key features and advantages of the proposed detector core are: 1) scalability in supporting 
multiple layers, 2) flexibility in accommodating multiple layer-configurations and detection of subsets of layers, 
3) configurability of supported constellations per layer, 4) support for soft-input log-likelihood ratios (LLRs) from 
channel decoder, 5) near-ML performance, and 6) reduced-complexity and high-throughput operation. We develop 
extensive optimizations at both the algorithmic and architectural levels targeted for a 2 x 2 soft-input soft-output 
MAP MIMO detector, as well as its extension to support more spatial layers. In particular, optimizations of 
distance computations (to eliminate multipliers and simplify slicing) are shown to result in substantial reduction in 
computational complexity when supporting constellations up to 256-QAM. Lurthermore, the complexity of a ID 
sheer is shown to play a key role in the overall complexity of the detector, when soft-input LLRs are supported. 
To this end, an efficient slicing scheme based on soft decision boundaries is presented. Moreover, a low-complexity 
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scheme that decomposes a MIMO channel into multiple subsets of decoupled streams is proposed. It is shown that 
decoupled streams can be detected efficiently and in parallel using the optimized 2x2 core. Moreover, the 2x2 
core is applied in the context of multi-user (MU-MIMO) for joint modulation classification and data detection. The 
core has been implemented on an FPGA, and synthesized as well using a generic 90 nm ASIC CMOS library. 

The rest of the paper is organized as follows. After introducing the system model in Section II, Section III 
presents the optimizations targeted for a 2-layer MAP MIMO detector in terms of distance computations and 
slicing. Key equations for distances and soft decision boundaries are derived assuming both zero and non-zero 
input LLRs. Section IV proposes a matrix decomposition scheme to support detection of more spatial streams. 
We show that the key distance equations scale in a straightforward fashion from the 2-layer case, where only 
a new distance-accumulation and a post-LLR processing phases are needed. In Section V, single and multi-core 
detector architectures are developed. The core is applied in Section VI part of MU-MIMO detection for constellation 
estimation and data detection. Synthesis and simulations results are reported in Section VII. Finally, Section VIII 
ends with concluding remarks. 

II. System Model 

Consider a MIMO system with N transmit and M > N receive antennas. The equivalent complex baseband 
input-output system relation can be modeled as y = Hx+n, where is the received complex signal vector, 

is the complex channel matrix, x =[a:i X 2 • • • G X = Xi x ■■■ x X^ is the N xl transmitted 

complex symbol vector, and is a zero-mean complex Gaussian circularly symmetric random noise vector 

with covariance Each symbol belongs to a complex constellation Xn of size (5„ = 2^", and is associated 

via the map b(-) with a coded bit-interleaved vector b(x„) = b„ = [6„ i bn ,2 • • • of length over the 

set { — 1, +1}, where binary 0 maps to -fl. Let \X\ = Q = 2®, and denote the binary vector associated with the 
overall symbol vector x as b(x) = [bi; • • • ; b^r] = [bn,j], for n= 1, • • • ,N, and j = 1, • • • ,qn. Motivated by recent 
standards, we assume rectangular QAM constellations, where A’„=7^„x'P„, and Vn is a ID P„-PAM constellation 

with Pn = \fQn. 

We assume H is known to the receiver, has full column rank and is decomposed as H = QL, where 
is a unitary matrix and L ^ lower triangular matrix (LTM) with positive and real diagonal elements. 

Since Q is unitary, it preserves Euclidean norm as well as noise statistics. Hence we use the transformed relation 
y = Q*y = Lx+Q*neC^^^ to model the MIMO system. 

A hard-decision (HD) maximum a posteriori (MAP) MIMO detector achieves log-max [46] optimal performance 
by finding the symbol vector x in A that is closest to the received vector y under the unsealed “distance” metric [16]: 

= ||y - Lxf - b^(x)A, (1) 

where A = [Ai; • • • ; Ajv] = [A„ j] is a column vector of a priori LLR values A„ ^ € TZ associated with the bits in 

b(x), assuming these bits are statistically independent: 

5 _ 1 1 Prob(6„j = -fl) 

- ^2 in ^ ■ 
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The subvector A„ = [A„ i, • • • , Xn,q„]'^ is associated with the bits b(a;„) of the nth symbol a;„. The hard-decision 

MAP solution of the MIMO detection problem is given by^ 

= min d{x.) and = argmin d(x). (2) 

xe/r xe/r 

For joint iterative MIMO detection and decoding however, soft-input soft-output MIMO detectors are required. 
A log-max optimal soft-input soft-output MAP MIMO detector computes 2q other minimum distance metrics as 
follows: 

min dfx) — min d(x.), (3) 

xe.T<+^> xe.:r(-3) 

n,3 ri,j 

for n = 1, • • • ,N and j = 1, • • • , where = {xS A” : bnj= + l} and = {xe A" : ^ = —1} are the 

subsets of symbol vectors in X that have their corresponding jth bit in the nth symbol -fl and —1, respectively. 


III. Optimized MIMO MAP Detection for 2 Layers 


Finding the MAP solutions in (2) and (3) require computing Y[n=iQn distance metrics. When N = 2, a 
simplification [44] can be applied to reduce the number of computations from Qi-Q 2 to Qi + Q 2 - Triangularizing 
the channel matrix as H = QL with Q being unitary, we obtain: 


Lx = 

2/1 


a 

0 

Xl 


2/2 


7 

/3_ 

X2 


where y = Q*y, with a ,/3 € and ySC. Then ( 1 ) becomes 

d(x) = /i(xi) -f f2{x2 I Xi), where 
/i(a;i) = |t/i-axi|^-b^(a;i)Ai, and 
f2{x2\xi) = \y2-'yxi-/3x2\‘^-h^{x2)\2- 
The minimum distance in ( 2 ) can then be computed as 

min(i(x)= min {/i(xi)-f/2(a;2 | a/i)} 

xGA" xiGA^i 

X2G^2 

= min i fiixi) + min/2(x2 | Xi) 

X2^X2 

= min{/i(xi)-I-/2(s2(a:i) I xi)} 

xi^Xi 

= min d{xi^X2(xi)) 

XI GXi 

where 

X2{xi) = argmini |y2-7a^i-/3a^2|^-b^(x2)A2 [ . 

3:26^2 t t 

Denote the set of sliced symbol vectors for all xi in (6) by 

Oi = |[a;i X2{xi)f' : xieAfij . 

The bit LLRs of symbol xi, for j = 1, • • • ,qi, are given by 

AMAP_ £2(3/1)) — min d{xi,X2{xi)). 

^ v(+i) _ ^ v(-i) 


(4) 


(5) 




(6) 

(7) 

( 8 ) 

( 9 ) 




’The quantities in (2) and Ajj’AP in (3) need to be scaled by cP' 12. 
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Fig. 1. Gray-coded mapping for 64-QAM in LTE [2]. 

To obtain the bit LLRs of X 2 however, we triangularize H as Q' L' so that a zero appears in the upper left 
corner: 


'y( 


0 

a' 

Xi 

J/2_ 


f 


X2 


y'-L'x = 

where y' = Q'*y; € 7Z'^ and 7' G C. Then (1) becomes 

= f 2 i.X 2 ) + fi{xi I X 2 ), where 
/ 2 (a; 2 ) = |j/'i-a'a; 2 |^-b^(a; 2 )A 2 , and 
f'lixi I X 2 ) = \y2-'y' X2- /3' xif-h'^{xi)Xj^, 
and the minimum distance in ( 2 ) can be computed as 

min(i(x)= min // 2 (a: 2 ) + min/{(xi | 0 / 2 ) 


xe/r 




X\ 


= min{/^(a;2) + /((ii |X2)} 

X2G^2 

= min d(aii(a: 2 ),a: 2 ), 

X 2 GX 2 

where a;i(x 2 )=argmin/{(a;i | X 2 )- Denote the set of sliced symbol vectors for all X 2 in (10) by 


( 10 ) 


O 2 = |[xi(x2) X2f' : X2eT'2| . 


The bit LLRs of symbol X 2 , for j = 1, • • • , ( 72 , are given by 

^MAP_ d{xi{x2),X2) — min c?(xi(a;2),a;2). ( 11 ) 

Since Q and Q' are unitary, the MAP solutions in ( 6 ) and (10) are identical. To find the hard-decision (HD)- 
MAP solution, only 1-sided QLD is needed on either layer 1 or 2. If Qi < Q 2 , a list of Qi distances T>i = 
{d{xi,X 2 {xi)) : XiGXi} is generated by enumerating all symbols Xi € Xi and the minimum is selected. If 
Q 2 < Qi, a list of Q 2 distances I ?2 = {d{xi{x 2 ),X 2 ) '■ X 2 GX 2 } is generated and the minimum is selected. 
However, to generate soft LLRs, 2-sided QLDs are needed, and both lists of distances must be generated to select 
the appropriate minima according to (9) and (11). 

A. Distance Metric Optimizations 

For efficient distance computations, we separate the real and imaginary parts of all complex variables, and exploit 
the fact that the real and imaginary parts of each QAM symbol are mapped independently into ID PAM symbols. 
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i.e., some bits are used only for mapping of the real part and some only for the imaginary part (see Fig. 1). Note 
that this mapping is used in, e.g., the IEEE 802.1 lac [1] and LTE [2] standards. Under this assumption, we can 
split the bias term b^(a;„)A„ into a part = b^(x„R)A„j^ associated with the bits of the real part of the 

QAM symbol, and a part b^jA„j = b^(a;„i)A„j associated with the bits of the imaginary part. Let 7 = 7r+J7i, 
Xn = XnR+jXni, 2/n = J/nR+j2/ni for n = l,2. Then the distance in (4) becomes 


d{x)=fiR{xiR)+fn{xil) + f2R{x2R\xi)+f2l{x2l\xi), 

( 12 ) 

where the terms on the righthand side are given by 

/iR(a:iR) = (t/iR —aa/iR)^ —bfj^A;^j^ 

/ii(a;ii) = 

/ 2 R(a; 2 R|a;i) = (2/2R-7Ra^iR+7ia^ii-/3a;2R)^-b^RA2R 

/21 {X 2 l\xi ) = {y 2 i-lRXii-yiXi^- j5x2iY - b^i A 21 . 

Expanding (12), minimizing with respect to a; 2 R and X 21 , and removing irrelevant terms. 

we obtain the following 

key equation: 

f^(x) = /iR(a;iR) + /ii(xii) 

+ min / 2 R(a; 2 R 1 xi)+ min /21 ( 0 / 211 a/i), 

(13) 

X 21 GV 2 

where V 2 is the ID PAM constellation in X 2 of layer 2, and 

/iR(a/iR) = ^a^iR+C'^/iR —bfj^Ajp^ 

(14) 

fiiixii) = Ax^j+Da/ii —b^jAj^j 

(15) 

f2Rix2R 1 Xi) = {ExiR + Fxii)x2R 

+ {Bx\Yi + Gx2R — h2Yi^2R) 

(16) 

f2l{x2l 1 Xi) = {Exii-Fxi'r)x21 

+ {Bx\i+F[x2i —^' 21 ^ 21 )- 

(17) 

The constant coefficients in (14)-(17) are given by 

A = a 2 + |7|% B = P\ 

(18) 

C =-2 {ayin + 7r2/2R + 7 iJ/ 2 i), 

D = -2 {ayn - yiy 2 R + 7 r 2 / 2 i) , 

E = +2^7r, F = - 2 / 371 , G = -2/32/2R, H = -2/32/21, 

(19) 


and can be precomputed off-line from H and y. The HD-MAP solution is obtained by populating all Qi distances 
in (13) and selecting the minimum. The same applies for the LLRs. 



IEEE TRANSACTIONS ON SIGNAL PROCESSING, DRAFT (16/06/2015) 


7 


B. Slicing Assuming Zero Prior LLRs 

Assuming the input LLRs A are zero, the rightmost term in (1) vanishes and the MAP detection problem reduces 
to a least-squares integer ML problem. Then X 2 in (7) can be obtained by slicing ( 7/2 — ^xi)l(3 to the nearest 

constellation point in X 2 using the operator [u]^ = argmin \u — x\: 

" xex„ 

X2 = e ^2. (20) 

By separating the real and imaginary parts as X 2 = X 2 R+jx 2 i 7 the slicing operation in (20) splits into: 

X 2 R = L(t/2R-7Ra:iR+7ia^ii)//31-P2 € V 2 , (21) 

^21 = L(j/2i-7Ra:ii-7ia;iR)//31-p2 € V 2 , (22) 

where V 2 = {pi,P 2 ,' •• iPP 2 } the P 2 -PAM constellation, and P 2 = 5 /^ 2 ■ The operations in (21)-(22) reduce to 
simple comparisons with the (deterministic) decision boundaries of 1^2 as follows. Let 22 = 1/2 — 72^1 = Z 2 R+jz 2 i 
where 

Z2R = l/2R-7Ra^iR+712/11, (23) 


2/21 = 1/21-7R2:ii-7ia/iR- 


Assume the constellation points are ordered such that Pi<Pk if i<k. Then X 2 r maps to the point pi that satisfies 


, P»-i +Pi 


+ ^2R < 


,P2+Pi+1 


(24) 


for 7 = 1, • • • ,P 2 , where pq = —00 and pp^+i = + 00 . Similarly for a; 2 i- Hence the actual distances f 2 ix 2 \xi) 
themselves need not be computed for all X 2 and a given xi in order to find the symbol X 2 that minimizes / 2 (a; 2 |a;i) 
in (7). Therefore, (6) requires only \Xi \ =Qi distance computations. By the same argument, (10) requires only 
\i^ 2 \=Q 2 distance computations. 


C. Slicing Assuming Non-Zero Prior LLRs 

When the prior terms are included in the distance computations, slicing cannot be directly applied in (7) since 
the decision boundaries now depend on the bias term b^(a; 2 )A 2 . We develop next an optimal scheme that enables 
efficient slicing similar to (24) based on [49]. In [50], a scheme that computes suboptimal slicing boundaries was 
presented. Compared to our approach, [50] incurs a performance loss with equivalent complexity. 

The real part of X 2 in (7) is given by 

X2R = arg min { {z2r- I 3 x 2 rY - b^(2:2R)A2R,} . 

To decide in favor of PiGV 2 , then Vk^i, we must have 

iz2R-f3p^f -h^ip^)X 2 p < {z2r-P pkf -h^{pk)X2p. (25) 

This condition can be formulated in terms of decision boundaries R{pi,pk) = R{pk,Pi)- 

R{Pi,Pk) = B-{Pi+Pk) ——^^A2 r, Vky^i, (26) 

Pi -Pk 

between pi and all other pk’s in V 2 - Assuming the points in V 2 are ordered such that Pi < pk if / < k, then 
for Pi to satisfy (25), we must have 2(3z2r < R{pi,pk) for all pk > pi- For pp^ to satisfy (25), we must have 
‘2/3z2r> Ripp 2 ,Pk) for all pk<pp 2 - For any other internal point pi, ij31,i^P2, we must have 2f3z2R<R{pi,Pk) 
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for all pfc >Pi, and 2/3z2r > R{pi,Pk) for all p/. <Pi. These conditions can be combined into a single condition 
for 2 = 1, • • • , P 2 , as follows: 

max R{pi,Pk) <‘^I3 z2k < min R{pi,pk), (27) 

, i—l k = i -\- l ,"- ,P2 + 1 

where po = —oo, PP 2+1 = + 00 , b(po) = ^{pp 2 +i) = 0ixg2/2- Note that (26) and (27) reduce to (24) when 

'^2R = 0lxg2/2- 

Substituting (23) for Z 2 R in (27), using the constants (19)-(19), and accounting for sign change, we obtain the 
following slicing condition that is suitable for hardware implementation: 

max R{pi,pk) - G < Exir +Fxii < min R{pi,pk)-G. (28) 

fc=2+l,--- ,P2 + 1 k — 0 ,--- , i—l 

Note that in (28), the maximum on the lefthand side is now taken over all points Pk&V that are greater than pi 
as opposed to smaller than pi as was done in (27) due to the change in sign. Similarly for the minimum on the 
righthand side in (28). 

A similar analysis applied to compute X 2 i= min / 2 i(a: 2 i|a^i) leads to the decision regions I{pi,pk): 

X 21 &V 2 

I { Pi , Pk ) = B-{pi + pk ) - ^ ^ ^^^^ -^21) (29) 

-Pk 

using now A 2 J, and the associated slicing condition: 

max I{pi,pk) - H < Exii - Fxip < min I{pi,pk) - H. (30) 

k — i -\- l ,--- ,P2 + 1 k — 0 ,--- , i—l 

Note that by construction of the decision boundaries in (27) (and their imaginary counterparts), the proposed 
approach is optimal. The approach in [50] however is suboptimal because it employs heuristics to compute simplified 
but suboptimal decision boundaries. 

IV. Extension to Higher-Order Layers 

The previous optimizations cannot be directly extended to iV > 3 layers because the structure of the lower 
triangular matrix L includes off-diagonal terms that prevent searching for the MAP solution by enumerating symbols 
in one layer and finding the minima through slicing individually on all other layers in parallel. More specifically, 
in Fig. 2(a), the presence of the demarked entries in the LTM implies that determining the MAP solution requires 
enumerating symbols on the first A^—1 layers and slicing only on the last layer, as is typically done in tree-search 
detectors (e.g. [30]), and hence still requiring 0(nn Qn) complexity rather than 0{J2n Qn)- 

One desirable structure of H for a 4-layer MIMO system would be as shown in Fig. 2(b), in which the demarked 
entries are zeroed out. Here, by enumerating symbols on layer 1, the minimum distances and associated symbols 
on layers 2 to 4 can be searched for in parallel through slicing only on the corresponding layers, similar to the 
2-layer system. This suffices to compute the LLRs associated with the bits on layer-1 symbol. A similar process is 
repeated by decomposing H according to the structures shown in Figs. 2(c)-(e) [47] to compute the LLRs for bits 
associated with layers 2 to 4. 

Other “punctured” structures are also possible for a 4x4 system as shown in Fig. 3. They differ in 1) the number 
of layers over which symbols are enumerated {enumeration or detection set), 2) the submatrix structure used to 
propagate these enumerated symbols and cancel their interference effect from the remaining layers (interference 
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Fig. 2. 4x4 structures: (a) Full; (b)-(e) punctured structures for every layer. 


cancellation set), and 3) the number of layers in which the minimum distance and associated symbol can be obtained 
by slicing after interference cancelation (slicer set). Let U denote the size of the enumeration set, S the size of 
the slicer set, and S xU the size of the interference cancellation set. We refer to this structure using the triplet 
{U, S xU, S). For example, in Fig. 3(a), we enumerate over U=1 layer only, cancel interference from this layer 
to the 3 other layers using a 3x1 structure, and slice over S'= 3 layers. In the structure in Fig. 3(b), we enumerate 
over U = 2 layers, cancel interference using a 2x2 structure, and slice over S = 2 layers. 

LLR values are generated for bits in symbols included in the detection set only. Complementary structures that 
enumerate symbols on other decoupled layers are required to generate their respective LLRs. For example, the 
(1, 3x 1, 3) structure requires 3 similar structures to generate LLRs for layers 2 to 4 (Fig. 2(c)-(e)). When ?7 > 1, 
decoupled layers can overlap by placing a stream with low reliability in multiple detection sets. 



slicers 



enum. set 


(»--KX) 

intf. cancel 


I slicers 

-\>o 

intf. cancel 



enum. set 


(X X X> »(X) slicer 
intf. cancel 


(a) 


(b) 


(c) 


Fig. 3. (a) (1,3x1,3), (b) (2, 2x2, 2), (c) (3,1x3,1) punctured structures 


A. WL Decomposition (WLD) Scheme 

In [51], a decomposition scheme was introduced to transform H into a punctured LTM L with a desired structure 
via a projection matrix W. In this section, we extend the scheme to handle soft-input MIMO detection using prior 
LLRs fed from a soft-input-soft-output channel decoder. We assume N = M. 

We seek a matrix W = [wi W 2 • • • w^v] such that W*H = L is a punctured LTM and L= [(yj 

with IhGTZ'^. In general, if L is punctured, then W is non-unitary and hence does not preserve Euclidean norm; 

y A w*y = Lx-f W*n (31) 

5 W = ||y-Lx||^ ^ |]y-Hxf = d(x) (32) 

However, if we impose the condition 

diag(W*W) = [l 1 ... 1]?’,^, 

then the transformed noise vector W*n has an unaltered covariance matrix E[W*nn*W] =(T^Ijv. 

To induce a specific pattern of zeros below the main diagonal in L, we choose the columns of W to be 
orthogonal to the columns of H = [hi h 2 ... h^v] where these zeros are to be introduced. More specifically, let 
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n = 1, • • • , N, be the column index sets where puncturing is desired in each row n of H. Denote Hx„ the 
submatrix formed by the columns of H whose index belongs to set In- Define the column vector w„ = Px h„, 
where 

=1n- Hx„(HJ„HxJ , (33) 

and Hx„ ={hm | m € X„}. Then the column vectors of W are given by 

Furthermore, it was shown in [51] that L and W*y can be derived using a simple modification to the standard 
QL decomposition procedure [52]. This avoids the need for expensive matrix inversion operations in (33). On 
modern vector digital signal processors (DSPs), matrix QLD operations are natively supported and optimized part 
of the instruction set. For example, on a CEVA XC-4210 processor [53], QL decomposition of a 4x4 complex 
matrix requires only 12 clock cycles. Hence, we assume that the channel matrix H has been preprocessed by a 
similar DSP, and detection is performed based on the transformed system in (31). Note that because of (32), the 
solution to the detection problem is no longer optimal (but still achieves near-optimal performance as demonstrated 
in Section VII). 


B. Optimized Detection Algorithm Using WLD 

We next present an optimized detection algorithm based on the WLD scheme, by extending the N = 2 case of 
Section III. Lor simplicity, we only consider decompositions of the form (l,S'xl,S'), similar to Lig. 2. The N 
layers are decoupled by first circularly shifting the columns of H, and then performing WLD on the permuted H. 
We refer to the decomposition whose detection set is the mth layer as the mth WLD of H. To simplify notation, 
we describe the detection steps for to = 1. The same steps apply to detect the other layers with an appropriate 
adjustment to the layer indices. Let 


(34) 


be the transmitted symbol vector, received signal vector, and the WL-decomposed channel matrix in normal order, 
respectively, where: € C, Xn&Xn for n = 1, • • • ,N-, a, and 7 „ € C for n = 2, • • • ,N. Then the distance 

metric p(x) of x from y based on L in (32) can be written as 

N 

g{^)=fl{xi) + '^fn{Xn\xi), (35) 

where 

/i(a/i) = |yi-axi|^-b^(a;i)Ai, and 



Xi 


yi 


a 

0 

0 

0 

0 


X2 


J/2 


72 

/32 

0 

0 

0 

x = 

X3 

, y= 

2/3 

, L = 

73 

0 

Ps 

0 

0 


XN 


. . 


In 

0 


0 

Pn 


fn{Xn\xi)=\yn-lnXl-/3nXn\'^-h'^{Xn)\ 



IEEE TRANSACTIONS ON SIGNAL PROCESSING, DRAFT (16/06/2015) 


11 


for n = 2, • • • ^N. We next minimize y(x) similar to (5): 


N 


where 


= min / /i(xi) + V min /„(a;„ | Xi) \ 

\ n=2 ) 

= min < /i(xi) + fn{Xn{xi) I Xi) } 

I ^2 J 

= min g{xi,X 2 ixi), - ■ ■ ,XAr(xi)) 

xi^Xi 

Xn{xi) = argmini |?/„-7„a;i-/3„x„|^-bya;„)A„ I . 


(36) 

(37) 


(38) 


Denote the set of sliced symbol vectors for all possible xi in (38) by (defined similar to (8) but for any N>2) 

Oi = |[a;i ^2(3:1) ••• XAr(xi)]^ : XieA'ij . 


The symbol vector that minimizes (35) is denoted as 

x'^^ = argminp(x). ( 39 ) 

xeOi 

To efficiently determine we optimize the distance computations in (36) by splitting the complex quantities 

into their real and imaginary components: 

/i(a:i) = /iR(a:iR)+/ii(a:ii) 

/iR(a:iR) = (t/iR —aa:iR)^ —bfj^AiR 

/ii(a;ii) = (2yii-axii)^-bnAii 

and 

fn (Xn) — fnR(XnR) “t” fnl{Xnl) 

/uR^X^rIxi) = {UnR TnR3:iR“t“7nl3:iI PnXnR.) ^nR'^^R 
fnl{Xjil\xi^ — (l/nl ^tlRXW '^nlXlR PnXnl) 

for n > 2. Substituting back in (37), expanding terms, minimizing w.r.t. XnR and Xni, and eliminating irrelevant 
terms, we obtain 

5 (x) =/iR(a:iR) +/ii(xii) + min /riR(x„R|xi) + min /„i(x„i|a:i) I (40) 

^' yXnReVn XniePn J 
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where 

fiiixii)=Axli + DxiK-hJi\ii 
fnYiiXnK) ~ -fn^ll)^nR 

+ iB nXnB.~^^nXnR 

fnl{Xnl) — {BnXll B^XiR^Xn! 

+ {B nX„i-\-HjiXnl 

Similar to (18)-(19), the constants above are given by: 

N 

A = ^ | 7 „|^ , Bn = (3^ 

"=2 

C = -2ayiR - 2 ^ (TnRl/nR + InWnl) 

n—2 

N 

D = -2ayii + 2 ^ (7nlJ/nR - InRVnl) 

n^2 

— H“2^n'7nRj ^n— Gn— ^^nyriKi ^n— ^j^nUnl- 


Using ^(x) (or y(x)), the LLRs of the bits in layer 1 are 

= min g{xi,X 2 {xi),-■ ■ ,xn{xi)) - min g{xi,X 2 {xi),-■ ■ ,xn{xi)). (41) 

The bit-LLRs in the remaining A^ — 1 layers are similarly obtained by using the other A^ — 1 complementary WL 


structures of H (see Fig. 2). Finally, equations (26)-(28) for N = 2 can be used to slice a;„R= min /nR(a:„R|a:i), 

XnU&'Pn 

and (29)-(30) to slice Xni = min fni{xni\xi), but with the constants B, E, F, G, H replaced by Bn, En, Fn, Gn, Hn, 

Xnl&'Pn 

and V 2 , A 2 R, A 21 by Vn, A„j^, A„j. 


C. Post LLR Processing 

Since g{x) 7 ^ d{x.), there is no guarantee that the g'^^ and obtained in (38) and (39) using one WLD 
structure of H, are the same ones obtained using the other A^—1 WLD structures with the columns of H permuted. 
To avoid confusion, we refer to the quantities in (35), (38), and (39) pertaining to the mth layer WL decomposition 
using the subscript to: pm(x), g^^^ 

The “WL-minimal” HD solution, denoted as and corresponds to the minimum of the N various g'^^ 
values: 


WL ■ WL WL • WL 

ffmin = mm , xA„ = argmmp,„ . 

'm yWL 


A similar minimization is required as well to adjust the LLR values relative to the global minimum 
and the bits of its corresponding symbol vector xJ^jL This adjustment cannot be done by comparing the individual 
A^j"’s alone. One simple way is based on the list of distances (/m(x) generated from all decompositions for 
TO = 1, • • • ,N, together with their corresponding symbol vectors. Let Om denote the set of symbol vectors 

~ ( [^1 {Xna) * ‘ ‘ Xra — 1 {.Xrn) Xm ^^m+l (Xm) ‘ ‘ ‘ XN {Xm)] ■ ^ ^ , TO = 1, * * * , IV, (42) 
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from DSP 


one-sided 2x2 MAP Detector 


support for higher-order layers 
22 |, 7)2 



Fig. 4. Block diagram of a parallel one-sided 2x2 MAP detector core, with input and output interfaces 


where the nth sliced symbol in the mth WLD is 


Xnixm) = argmin{|2/„ -7„ a;„-/3 , 

for n^m. Here yn,m^ ^n,m^ and are defined as in (34) but relative to the mth WLD of H. Next, define the 
partitions on Om- = ^ ■ ^n..y = +l} and = e Om ■ ^n.y = -l}- Then the “WL-minimal” 

LLRs are given by 


min< 

, min 3m(x) 

> — min < 

, min srm(x)) 

m 

-eoi+Zl. 



1 


1 1 



(43) 


D. Discussion 

The key equations for the general A^-layer case derived above reduce to the optimal equations derived in Section III 
for N = 2. A comparison between the two shows that the same operations applied to compute d(x) in (13) are 
applied to compute g{x) in (40), but using the respective constants of layer n instead of layer 2. Hence, a 2x2 
MAP detector can be viewed as a building block for constructing detectors for higher-order layers, with a simple 
modification to account for the extra accumulated sum terms in (40), in addition to the LLR processing of (43) at 
the output stage. A parallel architecture will be developed next and its complexity analyzed. 


V. Parallel 2-Layer Detector Architecture 

Figure 4 shows a block diagram of a parallel 2x2 MAP detector core that implements the key equations in (13)- 
(17). For flexibility and scalability to higher-order layers, the constellations supported on each layer are configurable 
from BPSK up to 256-QAM, and can be distinct on each layer. We assume the input constants (18)-(19) to the 
detector are supplied by an external DSP The outputs are two lists of distances and their associated lists 

of symbol vectors Oi, O 2 , which are fed to a post LLR processing stage to extract the LLRs values depending on 
the number of layers. 

A. Optimized Implementation of Distance Expressions 

A careful inspection of expressions (14)-(17) shows that c?(x) can be evaluated without using multipliers, assuming 
the constants are pre-processed and fed as inputs to the detector. The reason is that the variables xir, Xn, X 2 r, and 
X 21 are integers that belong to a PAM constellation. More specifically, in LTE [2], they are odd integers in the set 
7^2 = {2m+l I m = —P/2+l, • • • , 0, • • • , P/2—1} and P = y/Q 2 - Hence the terms that involve the products of xir. 
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TABLE I 

Distinct product terms to be computed: a;, j/, 2 eP2; t, s e 7 ?,. 


distinct terms 

2-PAM 

4-PAM 

8-PAM 

16-PAM 

r-\x\ 

1 

2 

4 

8 

r-\x\-\y\ 

1 

3 

10 

33 

9 

r-x 

1 

2 

4 

8 

(r-|a;|±s-|j/|)-| 2 | 

2 

14 

116 

914 

I'^ir'^irI 

1 

2 

4 

8 


xn, X 2 r, X 21 in (14)-(17) with the constants in (18)-(19) are simply integer multiples of these constants. These 
product terms can be computed using basic addition operations with appropriate power-of -2 manipulations of the 
operands without using expensive multipliers. Also from symmetry, only positive multiples need to be computed. 
Table I summarizes the number of various distinct product terms that need to be computed for various PAM 
constellation sizes. 

Moreover, the dot products between the input LLR vectors and all the bit vectors are simply all linear 

binary combinations of the < 71 / 2 = (log 2 Qi)/2 individual input LLRs Xi of 


±Al ± A2 ± • • • ± Xq ^/ 2 - 


Also from symmetry, only half of these sums actually need to be computed, giving a total of different 

sums. The same applies to other dot product terms in (15)-(17). 

Next, as a/iR runs over the Pi integers in Vi, the expression + —b^(a;iR)Aj^j^) takes Pi different 

values. However, because of the Gray mapping of the bits, then b^(—a:iR)A]^j^ ^ —b^(xiR)A]^R and hence there 
is no symmetry that can be exploited to save in computations here. The same argument applies to the three other 
expressions (Ax^j+TAa/n —b^jA^j), (i?a; 2 R + Ga; 2 R —b^j^A 2 j^), and (i 7 x 2 j+iTx 2 i —bJ|A 2 i) in (15)-(17). 

Finally, for the remaining sum of products of cross terms (i/^a;iR+Fa;ii)x 2 R, as a; 2 R cycles through the P 2 integers 
in V 2 , the expression takes P 2 different values for every pair (xiR,a:ii). However, for all possible (a:iR,a;ii), 
repetitions occur. The number of unique values of {Exir + Fxii)x 2 r is twice that of (i?|a;iR| ±F|xii|)|a/ 2 R| 
(summarized in Table I). By symmetry, these are also the same values taken by the other sub-expression {Exu — 
^'a;iR)a: 2 i in (17). 

Therefore, hardware complexity will be measured in terms of number of adders, in addition to number of (2:1)- 
multiplexers (muxes) needed to steer operands to these adders. Wider (n:l)-muxes can be constructed using n — 1 
( 2 :l)-muxes. 

We next determine the actual number of adders required to compute each of the unique terms in (14)-(17), 
assuming 256-QAM and its underlying ID 16-PAM constellation. The same analysis applies to other constellations. 
The required multiples for 16-PAM are {9, 25,49,81,121,169, 225} x A, which can be generated using 11 
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adders as follows: 

9A = 8A+A, 2^A=l&A+9A, 15A^16A-A 

4M = 642l-152l, 81A = 82A+mA, 7A = 8A-A 

121^ = 12821-721, 41A = 82A+9A, 16921 = 128^+4121 

31A = 32A-A 225A = 256A-31A 

Similarly, the 8 multiples CIxirI can be generated using 7 adders. For 8 values of can be generated as 

(Ai+A2)±(A3 + A4), (Ai+A2)±(A3 —At), 

(Ai —A2)±(A3+A4), (Ai —A2)±(A3 —At) 
with 12 adders. The other 8 are their negatives. 

To generate the unique elements of (i?|xiR|±F|a;ii|)|a; 2 R|, we first generate all unique sums with a: 2 R = l, i.e. 
(i7|a;iR|±F|a;ii|), such that gcd(|a;iR|, |a;ii|) = l, and then generate all their multiples. The number of unique sums 
of the form (i<i|a;iR|±F|a;ii|) with co-prime coefficients |xir| and |a:ii| from the set {1,3, • • • , 15} is 49. We next 
enumerate the unique multiples from each of these 49 classes. For (|xir| , |a:ii|) = (l, 1), there are 33x2 distinct 
multiples of {E±F). For (|a;iR| , |xii|) = (l,3) or (3,1), there are 18x2 distinct multiples of {E±3F) and 18x2 
of {3E±F). For (|xir| , |a:ii|) = (1, 5), (5,1), (3, 5), or (5, 3), there are 13x 2 distinct multiples of each. Finally, 
for the remaining 42 classes, there are 8 x 2 distinct multiples of each. Summing all distinct multiples we obtain 
914. 

Table II summarizes the various constants that appear in the computation of (i7|xiR|±F'|a;ii|)|a;2R| for 16-PAM, 
and how they are generated using addition operations involving powers-of-2 operands and other already computed 
constants. First, the odd multiples 3E, 5E, • • • , 15E, and 3F, 5F, • • • , 15F, require 14 adders. The term {E±F) 
and its 33x2 distinct multiples require all the 36 constants in Table II and hence need (36+l)x2 adders. The term 
(£^±37^) requires 18 constants {1,3,5,7,9,11,13,15,21,25,27,33,35,39,45,55,65,75}, and hence needs 2x18 
adders. The same count is needed for (3£±£). On the other hand, the term (£±5£) and its 13x2 distinct multiples 
require only 13 constants {1,3, 5, 7, 9,11,13,15, 21, 27, 33, 39,45} and hence need 2x13 adders. The same applies 
for (5£±F), (3£±5£), and (5£±3£). For the remaining 42 classes, only 8 constants {1,3,5,7,9,11,13,15} 
appear and hence need 2 x 8 x 42 adders. Summing all counts results in a total of 936 adders. Finally note that 
(£|a;iR|±£|a;ii|)|a; 2 R| takes the same values as (£|a:iR| + £|xii|)|x 2 i| but in a different order. 

B. Minimization by Exhaustive Search 

One approach to implement the minimizations in (13) is by exhaustive search. In (16), for every pair (a:iR,a;ii), 
16 out of 914 distinct values of (£|xiR|±£|xii|)|a; 2 R| pertaining to the 16 different a; 2 R’s are added to + 
Ga/ 2 R—b^RA 2 R, and the minimum is selected. Hence a total of 16x256 adders are needed to generate all possible 
values of / 2 R(a: 2 R | a/i). The same holds for f 2 i{x 2 i \ xi). To find the minimum among P 2 quantities, a binary 
tree of comparators comprised of P 2 ~l adders and P 2 ~l (2:l)-multiplexers are needed. A total of 2x256 such 
comparators are needed. Finally, the 256 minima from each case are added to complete the sum for d(x) in (13). 

To generate the hard-decision MAP solution, the minimum among the 256 distances d{x) must be taken and 
the corresponding constellation symbol be identified. This requires a total of 255 adders and 255 muxes. On the 
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TABLE II 

Constants that appear in (i?|3;iR|±F|xii|)|a;2R| for 16-PAM 


3 = 2 + 1 

5 = 4+1 

7 = 8-1 

9 = 8+1 

11 = 8+3 

13 = 16-3 

15 = 16-1 

21 = 16 + 5 

25 = 16 + 9 

27 = 32-5 

33 = 32+1 

35 = 32 + 3 

39 = 32 + 7 

45 = 32+13 

49 = 64-15 

55 = 64-9 

63 = 64-1 

65 = 64+1 

75 = 64+11 

77 = 64+13 

81 = 32+49 

91 = 64+27 

99 = 64+35 

41 = 32+9 

105 = 64+41 

117 = 128-11 

121 = 128-7 

135 = 128+7 

143 = 128+15 

37 = 33+4 

165 = 128 + 37 

169 = 128+41 

61 = 64-3 

195 = 256-61 

31 = 32-1 

225 = 256-31 


other hand, to compute the output LLRs of the bits in xi according to (9), the 256 distances in (13) must be 
minimized over two complementary sets for every bit and their difference be taken. The 256-QAM constellation 
points can be viewed as 16 columns each containing 16 points, or as 16 rows each containing 16 points. In LTE, 
the 4 bits corresponding to the real part of the constellation points do not change in every column, and the 4 bits 
corresponding to the imaginary part do not change in every row. Hence it suffices to take the minimum distances 
among all points in each row and among all points in each column independently. The column minima are used 
to compute the LLRs of the real bits by partitioning the columns into two groups of 8 columns depending on 
whether the bit is +1 or —1 in the column. The minimum distance among each group of columns is taken, and 
the difference of the two minima generates the LLR of that bit. The same applies to the imaginary bits and the 
row minima. Hence a total of 2x16 16-point comparators are needed, amounting to 480 adders and 480 muxes, to 
extract the minima, followed by 8 adders to take the differences. 

Table III summarizes the core complexity using exhaustive search. The core requires 18290 adders and 8160 
muxes. 

C. Minimization by Slicing 

We next analyze the complexity of computing min f 2 K{x 2 R\xi) in (13) via the slicing approach by hrst 

X2n&'P2 

determining X 2 r = argmin/ 2 R(a; 2 R|Ti) followed by evaluating f 2 R{x 2 R\xi), for all possible Xi. To minimize 
_ ^2rG'P2 

f 2 R{x 2 R\xi), the decision boundaries R{x 2 r, X 2 r) in (26) must be computed for all X 2 R^X 2 R&'P 2 , and appropriate 
minima and maxima must be extracted from these boundaries according to (28) and compared to Exir + Fxh. 
Similarly, to minimize f 2 i{x 2 i\xi), the decision boundaries /(a/ 21 , a/ 21 ) in (29) must be computed for all a /21 ^a /21 S 
1^2, and appropriate minima and maxima must be extracted from these boundaries according to (30) and compared 
to Exii — FxiR. 

By analogy, it suffices to analyze the complexity of (26) and (28). Since R{x 2 r,X 2 r) = R{x 2 R,X 2 r), only 
/2(/2~1)/2 = 120 decision boundaries need to be computed (see Lig. 5). The sum |a/ 2 R +a/ 2 R| takes P 2—2 distinct 
non-zero values (2,4, • • • , 2^2—4), and hence the product /3|a;2R-l-a/2R| term in (26) requires 6 adders. Similarly, the 

_ 'J' 

difference |a: 2 R —a/ 2 R| takes P 2 —1 distinct non-zero values (2,4, • • • , 2P2—2). Lor the division of (b 2 R —b 2 R) A 2 R 

_ _ 'J' 

by these constants, where b 2 R=b(a/ 2 R), the term (b 2 R —b 2 R) A 2 j^ takes 80 distinct values, 40 of which can be 

_ 'J' 

obtained by negation. These 40 values require 22 adders. The required ratios (b 2 R —b 2 R) A 2 R/(a/ 2 R —a: 2 R) take 
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TABLE III 

Resources of detector core using exhaustive search 


^ adders (& muxes) 

2-PAM 

4-PAM 

8-PAM 

16-PAM 


0 

1 

4 

11 

C I^irI 

0 

1 

3 

7 


^IR-^lR 


0 

2 

6 

12 

/iR(a;iR) 

4 

8 

16 

32 

D |a;iR| 

0 

1 

3 

7 


bflAii 


0 

2 

6 

12 

/ll(a:il) 

4 

8 

16 

32 

fl{xi) = fiR(xiR) + fu(xii) 

4 

16 

64 

256 

B^Ir 

0 

1 

4 

11 

G |a:2R| 

0 

1 

3 

7 


^2R'^2R 


0 

2 

6 

12 

BxI^ + Gx2-r-'\ 

*^2R-^2R 

4 

8 

16 

32 

H\x2i\ 

0 

1 

3 

7 




0 

2 

6 

12 

Bx2i+Hx21 — 

*^21-^21 

4 

8 

16 

32 

(S|a:iR| zbE’Ixii 

l)k2R| 

2 

16 

122 

936 

/2R(a;2R|a;i) 

8 

64 

512 

4096 

rfi2R =min{/2R(a:2R|a;i)} 

4 

48 

448 

3840 



muxes —>• 

4 

48 

448 

3840 

/2i(a;2i|a;i) 

8 

64 

512 

4096 

rn 2 i =min{/ 2 i(a; 2 i|a;i)} 

4 

48 

448 

3840 



muxes —>• 

4 

48 

448 

3840 

fl +m2R+m2i 

8 

32 

128 

512 

HD solution: min{/i+m2R+7n2i} 

3 

15 

63 

255 



muxes —¥ 

3 

15 

63 

255 

soft-output LLRs 

6 

28 

118 

488 



muxes —>• 

4 

24 

112 

480 

Total (soft-output) 

60 

346 

2460 

18290 




12 

120 

1008 

8160 


_ 'j' 

only 40 distinct values, and require divisions by 3,5,7,9,11,13,15. However, each value of (b 2 R —b 2 R) A 2 j^ 
need not be divided by all these 7 constants. By going over all various combinations, it is easy to show that the 

number of divisions by the various values of |a: 2 R —a; 2 R| is as follows (constant : count): 

2:4, 4:3, 6:5, 8 : 2, 10 : 5, 12 : 3, 14 : 4, 16 : 1, 

18 : 3, 20 : 2, 22 : 3, 24 : 1, 26 : 2, 28 : 1, 30 : 1 

Divisions by powers-of-2 are trivial. Division by 3 covers division by 6 = 3x2, 12 = 3x4, and 24 = 3x8, and hence 
is needed 9 times. Division by 5 covers division by 10 and 20, and hence is needed 7 times. In a similar fashion, 
division by 7 is needed 5 times, by 9 is needed 3 times, by 11 is needed 3 times, by 13 is needed 2 times, and by 15 
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/?(x2R,X2R ) 


Fig. 5. Computation of decision boundaries 

is needed once. The total number of such non-trivial divisions is 30. The complexity of a division-by-small-constant 
circuit is roughly equivalent to a small number of adders for small bit-widths. Specifically, a divide-by-3 is equivalent 
to 1 adder; by 5, 7, 9, and 11 are equivalent to 2 adders; and by 13 and 15 are equivalent to 3 adders. Hence, the 
ratios in (26) can be computed using 54 adders. Finally, computing all 120 decision boundaries by adding/subtracting 
the various 14 non-zero values of B\x 2 k+X 2 k\ to the various 40 distinct ratios (b 2 R —b 2 R) A 2 R/(a: 2 R —a; 2 R) 
requires 112 adders {B |a; 2 R+a; 2 R| =0 in 8 cases out of the 120). 

Moving to (28), a subset of P 2 ~l minimum and P 2 ~l maximum regions must be extracted from these boundaries 
for every hypothesis point X 2 r w.r.t. all other P 2 —1 points in V 2 - These can be obtained using a set of P 2 comparator 
trees, comprising a total of 14x 15 = 210 adders and 210 (2:l)-MUXs. Next, G is subtracted from each of the 
P 2 — 1 min and P 2 —1 max boundaries using 30 adders. Finally, comparisons between PxiR-|-Pa;ii and these 
min/max boundaries are required to determine a; 2 R according to (28). Each comparison requires 30 adders. Only 
128 such comparisons are needed for |Pa;iR±Pa;ii|, requiring a total of 3840 adders. The other 128 are derived 
by symmetry. Figure 6 shows the architecture of the sheer block in Fig. 4. 

Based on the results from the sheers, the a; 2 R’s are used to evaluate / 2 R(a; 2 R|a^i)- This is done by selecting 
the appropriate multiples |(Pa;iR±Pxii)a; 2 R| to be added to Bx 2 ^ + Gx2n — iy'"{x 2 R)^ 2 R- Hence 256 adders are 
needed, in addition to 128 (8:1)-MUXES and 256 (16:1)-MUXES. 

Table IV summarizes the complexity resources of the sheer-based detector. The architecture requires 11246 adders 
and 10372 muxes, which amount to a 38.52% savings in adders and an increase of 27.11% in muxes compared 
with the previous architecture using exhaustive search minimization. The internal pipeline registers, output buffers 
and accumulators in Eig. 4 are the same between the 2 architectures, and thus are not included in the comparisons. 


D. Multi-Core Detector Architectures 

Depending on the target throughput and the number of antennas N in the MIMO systems, multiple detector cores 
similar to Eig. 4 can be configured to build a MIMO detector. Eigure 7 shows a 2-sided fully parallel 2x2 MIMO 
detector architecture that uses 2 separate cores to detect the two streams. Since the detection algorithm in this case 
in optimal, the post LLR processing stage simply implements (9) and (11), without the need for distance buffering 
and accumulation. 

Eigure 8 shows a 4-sided fully parallel 4x4 MIMO detector that uses 4 cores to process the 4 streams. Here distance 
buffering and accumulation are needed before LLR processing in order to adjust the individual LLRs according 
to (43). In this case, the WLD matrix inputs for all 4 streams using the decompositions in (34) are needed. If chip 
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Fig. 6. Block diagram of optimized slicer architecture 


k 



2-sided MAP detector 


Fig. 7. Block diagram of 2-sided MAP detector 

area is the constraining factor, a MIMO detector can be built using a single core that is time-multiplexed among 
the 4 streams. 


VI. Application to MU-MIMO Detection 

Multi-user MIMO (MU-MIMO) has been proposed as a method for increasing the capacity of wireless net¬ 
works [54], [55]. In MU-MIMO, multiple users are scheduled on the same physical resource blocks (PRBs). 
Several receiver processing methods have been proposed in the literature for MU-MIMO [55]-[58]. We consider 
an optimal MU-MIMO detection method based on the joint constellation estimation of the interfering user and data 
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TABLE IV 

Resources oe detector core using slicers 


^ adders (& muxes) 

2-PAM 

4-PAM 

8-PAM 

16-PAM 

fl{xi) = fu(xii) 

4 

16 

64 

256 

B\x2K-\-X2r\ 

0 

0 

2 

6 

(b2R —b2R)^A2R 

0 

2 

8 

22 

(I> 2 R- 62 R)^^ 

^2R-^2R 

0 

1 

10 

54 

R{x2R,X2r) 

0 

4 

24 

112 

min / max boundaries 

0 

6 

42 

210 

(MUXES) 

0 

6 

42 

210 

min / max boundaries—G 

2 

6 

14 

54 

|ETiR±Fa:ii| ■ |a:2R| 

2 

16 

122 

936 

Compare \Exir^Fxii\ 





and min / max boundaries—G 

4 

48 

448 

3840 

f2R{£2R\3^l) — \ExiR±Fxii\- |:r2R| + 

4 

16 

64 

256 

(■B^sR + G*2R - b^(i2R)^2^) 

4 

56 

544 

4736 

{b2l —b2l)^A2j 

0 

2 

8 

22 

(b2l-b2i)^-^ 

"^21-^21 21 

0 

1 

10 

54 

I(x2I,X2l) 

0 

4 

24 

112 

min / max boundaries 

0 

6 

42 

210 

(MUXES) 

0 

6 

42 

210 

min / max boundaries — 

2 

6 

14 

54 

Compare \Exii^Fxi^\ 





and min / max boundaries — 

4 

48 

448 

3840 

f2l{£2l\xi) = \Exii^Fxir\ -\x21 \ + 

4 

16 

64 

256 

^^£21 — b^(£2i) A2 i^ 

4 

56 

544 

4736 

/l (a:i)-E/ 2 r(* 2 R 1 a:i)-t /21 (£21 1 a:i) 

4 

32 

128 

512 

soft-output LLRs 

6 

28 

118 

488 

muxes —>• 

4 

24 

112 

480 

Total 

36 

258 

1654 

11246 

(2:1)-MUXS 

12 

148 

1284 

10372 


detection. The optimal MU-MIMO detector can be efficiently implemented with a slight modification of the MAP 
MIMO detector developed in Section III. 

A. MU-MIMO System Model 

We consider a practical OFDM-based MU-MIMO system where 2 users are co-scheduled on the same PRBs, 
and each UE has 2 receive antennas. Let K be the number of tones in each PRB. Also let user 1 denote the user of 
interest with known constellation As, while user 2 denotes the interfering user whose constellation Aj is unknown 
to user I’s receiver. The received frequency-domain complex signal vector at the UE of interest on the 
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Fig. 8. Block diagram of 4-sided MAP detector 

A:th resource element (RE) over which the 2 users are scheduled is given by 

y[fc] =H[fc]x[fc]-|-n[A:] 

= hi[fc]a;i[fc]-|-h 2 [fc]a; 2 [A:]-|-n[/c], k = , AT, 

where H[A:] = [hi[fc] h 2 [fc]] is the complex channel matrix with hi[fc] and h 2 [A:] representing the cascade 

of the channel and precoders of user 1 and user 2 , respectively; x[fc] = 0 / 2 denotes the transmitted 2 x 1 

QAM symbol vector where a;i[/c] GXq, X 2 [k] € A); and n[/c] is the noise vector at the fcth RE modeled as a 

zero-mean complex Gaussian random vector with variance 


B. ML MU-MIMO Detection 

The maximum likelihood estimate of the constellation of the interfering user based on y[l], • • • ,y[K] is given 
by 


K 

fe=l 


Xi = arg max p 

Xi^M 

where K is the number of REs over which Xi is constant, and 




M = {4-QAM, 16-QAM, 64-QAM, 256-QAM} , 
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denotes the set of allowable constellations for the interferer. Assuming that a; 2 [fc], n[fc] are independent for 

all k = l, - ■ ■ ,K, the ML estimate of the interferer’s constellation can then be written as 

1 ^ 

Xi = arg max- n E E P(y[^] I 'R[k],Xs,Xi,xi[k],X2[k]), (44) 

A/ieAt l-Lil k=l xi[k]^Xs X 2 [k]^Xi 

where \Xi\ denotes the size of the interfering user’s constellation, under the assumption that 

P(a^i[fc])= and P(a; 2 [fc]) = 1 ^, k = l,---,K. 

Let (i(x[fc]) = |jy[fc]—H[fc]x[fc]||^/cr^, we can then write (44) as 

1 ^ 

i’i = argmax-^ ^ exp(-(i(x[fc])). 

Xi^M I All k=l :x.[k\£XsycXi 

Using the log-max approximation [59], we can approximate the ML estimate Aj by [60] 

Ai«argmin I ATlog (|A[|) + min (i(x[fc]) | , (45) 

Xi^M y “ xlfelGA/sxA/i j 

where log(-) is the natural logarithmic function. 

Once the co-scheduled user’s constellation, Aj, is estimated, then the LLR of the jth bit of the desired user QAM 
symbol xi [k] on the fcth RE is given by [44] 

~ min fi(x[fc]) - min d{x[k]), (46) 

xi[k]GP(^Qj xi[k]GP(^Qj 

X2[k]G^i X2[k]G^i 

where Ag^^ = {a; € As : bj = +1} and Ag = {x € As : bj = —1}. As seen from (46), computing the LLRs 
involves the same distance computations as those needed for the co-scheduler user’s constellation estimation in 
(45). This fact is exploited in the architecture of a joint constellation classifier and data MU-MIMO detector shown 
in Fig. 9, which uses an optimized one-sided MAP MIMO detector as its core. The MIMO detector processes the 
received signal y[fc] assuming all 4 possible choices of the interferer’s constellation. It generates 4 corresponding lists 
of minimum distance metrics d{x.[k]) and their associated symbol vectors x[A:] for all the |A4| possible hypotheses 

of the interferer’s constellation, with xi[k]GXs. These distances and symbols are stored in 4 buffers each of size 

I As I as shown in Fig. 9. 

For each tone, the minimum distance from each list is passed to an adder that accumulates the minimum distances 
over a span of K tones, during which the interferer modulation is assumed to be static. The resulting 4 minimum 
accumulated distances for each interferer hypothesis are stored in a buffer. The minimum from this buffer is used 
to identify the interferer’s constellation, and the corresponding stored distances in the buffers are selected and 
forwarded for LLR processing according to (46). 

Note that since the interferer’s modulation constellation remains static over K tones for a duration of 1 subframe 
in LTE (14 OFDM symbols), the particular choice of AT = 12 results in substantial savings in computations. The 
detector only needs to run in the above mode to identify the interferer’s constellation for one OFDM symbol in the 
subframe. It can then switch back to normal ML detection mode (without modulation classification) to generate the 
LLRs for the remaining 13 OFDM symbols for the user of interest xi[k]. 

Taking the LTE scenario for hardware complexity analysis, the total number of possible tones in 1 PRB in a 
subframe is 12 x 14 = 168. Of these tones, 28 are reserved for pilots (for cell specific reference signals and for 
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A 


Fig. 9. Block diagram of a MU-MIMO detector 

UE specific pilots to support the MU-MIMO transmission mode), and 140 for data. In the hardware architecture 
of Fig. 9, the total number of distance computations needed to generate the LLRs from these 140 data tones is 
(140-1-12 X 5) X |As|. This corresponds to an increase of only 42.86% compared to the distances computed by an 
ML detector with perfect knowledge of the interferer. 

Figure 10 shows the results when Ag is 64-QAM, with A) being 4-, 16-, and 64-QAM using A = 24 resource 
elements. The plots show that the ML classification method has a 5 dB gain over the basic nulling approach when 
As is 4-QAM, and 2 dB gain in the case of Ag being 64-QAM. Therefore, the gain of the ML classification method 
is largest for small constellation sizes of the desired signal, i.e., the largest gain is attained when the receiver 
complexity is minimal. 

Figure 11 shows the performance of the joint ML classification and detection method as compared to an ML 
receiver that has perfect knowledge of the interfering user’s constellation. Also shown in the figure is the performance 
of the linear MMSE receiver that only uses the knowledge of the interfering user’s channel and does not exploit 
knowledge of the interferer’s constellation. Both users use 64-QAM, with the turbo code of [61] and encoding 
rate 1/2 using block size 6144 bits. The pedestrian-A (Bed-A) [62] multi-path fading channel with high antenna 
correlation was used. The effective channel matrix is given by H = R/ HcRr , where Hj, is channel whose 
entries are uncorrelated and generated according to the Ped-A model, R^ and R^ are the transmit and receive 
antenna 2x2 correlation matrices, respectively, which have 1 on the diagonal entries and 0.9 on the off-diagonal. 
As seen from Fig. 11, the joint ML classification and detection receiver is only 0.1 dB away from an ML receiver 
that has perfect knowledge of the co-scheduled user constellation. The MMSE method has a significant performance 
degradation as compared to the joint ML classification and detection receiver. 
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Desired user constellation = 64-QAM 



Fig. 10. Probability of correct interferer modulation constellation detection versus SNR [60]. Solid lines are for nulling approach, dashed are for 
the ML approach. Desired user constellation is fixed to 64-QAM and the co-scheduled user constellation is 4-, 16-, and 64-QAM. The channel 
is i.i.d. block fading. 

VII. Implementation and Simulation Results 

The proposed 2x2 reconfigurable MIMO detector architecture was modeled in VHDL and synthesized on a 
Xilinx Virtex®-6 FPGA. The core was also synthesized using a 90 nm CMOS ASIC library. The experimental 
simulations below evaluate the coded bit-error rate (BER) performance of the proposed detection algorithm and the 
implemented core, assuming a MIMO system employing either 2 transmit and 2 receive antennas, or 4 transmit 
and 4 receive antennas. The channel encoder is based on the LTE turbo encoder specification [2] with interleaver 
length 1024, using 16-QAM, 64-QAM, and 256-QAM modulation constellations. The channel entries are assumed 
to be i.i.d. complex Gaussian random variables with unit variance. At the receiver end, we assume perfect channel 
knowledge. The turbo decoder implements the true A Posteriori Probability algorithm, and performs 4 full decoding 
iterations. Also, the detector and turbo decoder perform up to 4 outer joint detection and decoding iterations. Channel 
decomposition is performed externally by a pre-processing stage and the coefficients in (18)-(19) are fed as input. 

A. Performance Results 

The bit-precision of the detector architecture can be configured to enable tradeoff analysis between gate complexity 
and tolerable degradation in BER performance due to quantization noise. Eigure 12 compares the BER performance 
of the detector core for 2 layers and 64-QAM under various integer and fractional bit-widths, versus floating-point 
performance, at SNR = 14 dB. The x-axis denotes the number of joint detection and decoding iterations. The 
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Fig. 11. BLER versus per-tone per-antenna SNR (dB). EPA channel, high correlation (0.9), 64-QAM for both users and code-rate 1/2. 




Fig. 12. BER vs. outer detection-decoding iterations for various bit-precisions at SNR = 14dB for a 2 x 2 MIMO system with 64-QAM. 
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Fig. 13. BER vs. SNR for vaiious bit-precisions and up to 4 outer detection-decoding iterations, for a 2 x 2 MIMO system with 64-QAM. 
top figure corresponds to a fixed-point representation of (I.F) = {8.6,8.7, 8.8,8.9}, where I denotes integer bit- 
precision while F denotes fractional bit-precision. The bottom figure corresponds to the representation of (I.F) = 
{9.6, 9.7,9.8, 9.9}. As can be seen, when F starts to drop to 6, the BER starts to degrade. There is no significant 
improvement in BER in going beyond 1 = 9 integer bits, as demonstrated also in Eig. 13. 

Eigure 14 compares the BER performance of the core using 16-QAM, 64-QAM, and 256-QAM. The plots 
demonstrate that most of the coding gain is attained after 3 outer iterations, assuming the inner turbo decoder 
performs at most 4 full turbo decoding iterations. 

In Eigs. 15 and 16, the BER performance of a 4 x 4 MIMO system using the proposed WED scheme is simulated. 
In Eig. 15, the plots compare the BER versus SNR of the proposed WED scheme with E = 1 and 2 structures 
(Fig. 3a-3b), versus ML, zero-forcing (ZF), the approach of [47], and the sphere decoder with radius clipping [30], for 
16-QAM. Both overlapping and non-overlapping subsets are considered. Two scenarios for distance computations 
in (32) are followed; one based on H and one on L. The plots demonstrate that WED with E = 2 using H 
distances with overlapping subsets performs virtually as ML, and is less than 0.1 dB away from ML with no 
overlapping. Also, for single streams, L distances perform better than H distances. The plots correspond to one 
outer detection-decoding iteration, and 4 full internal turbo decoder iterations. 

Figure 16 compares the BER performance for 64-QAM. The plots demonstrate again that the WLD scheme with 
E = 2 using H distances and overlapping subsets performs very close to ML. Figure 17 shows the results for 
256-QAM. 
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Fig. 14. BER vs. SNR for a 2 X 2 MIMO system with 16-, 64-, and 256-QAM. 
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Fig. 15. BER vs. SNR plots for a 4x4 MIMO system with 16-QAM. 
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Fig. 16. BER vs. SNR plots for a 4x4 MIMO system with 64-QAM. 



Fig. 17. BER vs. SNR plots for a 4x4 MIMO system with 256-QAM. 
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Fig. 18. Hardware complexity of various synthesized detector cores. 

B. Architecture Synthesis Results 

Various architecture configurations for the 2x2 core with different algorithmic features and architectural 
optimizations were synthesized, assuming 17-bit datapaths. The datapaths are pipelined with 6 stages and clocked 
at 275 MHz. The input LLRs fed from the turbo decoder are 8 bits wide. The output LLRs from the detector are 
passed to a dynamic scaling block (not included in this work) that scales the bit-widths down to 8 bits before 
feeding them to the turbo decoder. 

Figure 18 shows the gate complexity of 8 different architectures. Four architectures support reconfigurable 
constellations up to 64-QAM, while the other four support up to 256-QAM. For the 64-QAM case, two architectures 
are designed to support soft-outputs only without soft-inputs (i.e. ML detection, see Section 111-B); one based 
on distance minimizations using exhaustive search (Section V-B), and one based on minimization via slicing 
(Section V-C). The other two 64-QAM architectures support both soft-outputs and soft-inputs (i.e. MAP detection, 
see Section Ill-C), one with minimization based on exhaustive search and one via slicing. The other four 256-QAM 
architectures are similar. All architectures have the same input/output interfaces, external buffers, and control logic. 
The reported gate counts in gate-equivalent (GE) are for the core logic only. 

The plots demonstrate that there is a significant increase in complexity (between 6.35x-6.82x) when supporting 
256-QAM compared to 64-QAM. Furthermore, the sheer-based architectures using the proposed scheme in Sec¬ 
tion V-C offer significant reduction in complexity compared to distance minimization by search (between 19.58%- 
26.22% for 64-QAM, and between 24.28%-30.35% for 256-QAM). Finally, for sheer-based architectures, supporting 
soft-inputs for MAP detection comes with an increase in gate count between 8.49%-9.83% compared to soft-output- 
only ML detection. For minimization-by-search architectures, the overhead of supporting soft-inputs is only between 
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Fig. 19. Hai'dware complexity as a function of bit-width. 

0.51%-1.71%. The gate counts predicted by the theoretical analyses in Section V are also plotted in Fig. 18. The 
error ranges between 8%-ll%, which asserts the validity of the model used and the theoretical analysis performed. 

Figure 19 plots the gate complexity of the slicer-based MAP cores as a function of bit-width. The complexity 
increases roughly between 5.2%-5.9% for every added bit. A similar trend was observed when synthesizing the 
256-QAM core with soft-outputs only on a Virtex-6 FPGA. The area increases from 317937 LUTs (33%) for 18 
bits to 337210 LUTs (35%) for 19 bits. The area jumps to 403498 LUTs (42%) when the integer bit-width is 
increased to 12 bits. 

The core achieves an average SNR-independent throughput of 2.2 Gbps for 2-layers with 256-QAM, when running 
in soft-input soft-output mode. In 4 x 4 mode, the core achieves a throughput of 733 Mbps and consumes 320.56 mW 
of power. This compares favorably with other detectors in the literature with throughput ranging from 757 Mbps at 
410kGE [11]; 772Mbps at 212kGE [33]; L2Gbps at 1097kGE for 16-QAM [35]; and 2.2Gbps at 555kGE [36] 
for up to 64-QAM only. Table V provides a comparative summary of our implemented detector and the detectors 
in [11], [33], [35], [36]. 


VIII. Conclusions 

A configurable 2-layer soft-input soft-output MIMO detector core has been proposed as a basic building block for 
constructing detectors with more spatial streams. Optimizations targeting distance computations and slicing opera¬ 
tions reduce the overall complexity when supporting constellations up to 256-QAM. By appropriately decomposing 
the MIMO channel, multi-layer detection is casted in terms of multiple parallel 2-layer detection problems, which 
can be mapped onto the 2-layer core. Various architectures have been developed to achieve a high target detection 
throughput. The proposed core has been applied as well to the design an optimal MU-MIMO detector for LTE. 
The core occupies an area of 1.58 MGE and achieves a throughput of 733 Mbps with 320.56 mW of power for 
256-QAM when synthesized in 90 nm CMOS. Euture work will target expanding the core to handle 1024-QAM. 
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TABLE V 

Summary and comparison of implementation results 


Reference 

This work 

[11] 

[33] 

[35] 

[36] 

Antennas 

<4x4 

<4x4 

<4x4 

< 4x4 

4x4 

Modulation [QAM] 

< 256 

< 64 

< 64 

16 

64 

Algorithm 

WLD 

MMSE 

-PIC 

STS-SD 

Trellis 

search 

FSD 

Iterative 

YES 

YES 

YES 

YES 

YES 

Technology [nm] 

90 

90 

90 

65 

90 

Core Area [kGE]“ 

1580 

410'’ 

212 

1097 

555 

Clock freq. [MHz] 

275 

568 

193 

320 

370 

Maximum 

Throughput [Mbps] 

2200 (2x2) 

733 (4x4) 

757 

772 

1200= 

2200 

Normalized hardware 

efficiency [kGE/Mbps] 

0.72 (2x2) 

2.16 (4x4) 

0.54 

0.28 

0.91 

0.25 

Power consumption 

320.56 

189.1 

87.62 


335.8 

in [mW] @ [Mbps] 

@ 733 

@ 757 

@ 111 


@ 2200 

Energy efficiency 

in [nj/bit] 

0.44 

0.25 

0.11 

— 

0.15 


®One gate-equivalent corresponds to a 2-input drive-1 NAND gate. 

^ Includes preprocessing circuitry. 

Technology scaling to 90 nm CMOS technology according to A ~ I/ 5 , 
tpd ~ l/«, and Pdyn ~ (l/s)(Vid/Ud) [111- 
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